WHAT IS CLAIMED IS: 



in 



10 



a 15 



1 . A method for building a decision tree from an input dajta set, the 
input data set comprising records and associated attributes, the/attributes 
including a class label attribute for indicating whether a given record is a member 
of a target class or a non-target class, the input data set/being biased in favor of 
the records of the non-target class, the decision tree/comprising a plurality of 
nodes that include a root node and leaf nodes, s^fd method comprising the steps 
of: 

constructing the decision tree from tKe input data set, including the step of 
partitioning each of the plurality of nodes of the decision tree, beginning with the 
root node, based upon multivariate^ubspace splitting criteria; 

computing distance functions for each of the leaf nodes; 

identifying, with respeci to the distance functions, a nearest neighbor set 
of nodes for each of the leaf nodes based upon a respective closeness of the 
nearest neighbor set oftriodes to a target record of the target class; and 

classifying ana scoring the records, based upon the decision tree and the 
nearest neighbor^set of nodes. 



2. / The method of claim 1 , wherein said constructing step comprises 
the steps'of: 
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forming a plurality of pre-sorted attribute lists, each of the pluralitj/of 
pre-sorted attribute lists corresponding to one of the attributes otherytnan the 
class label attribute; and 

constructing the root node to including the plurality of jafe-sorted attribute 

lists. 

3. The method of claim 2, wherein said f6rming step comprises the 
step of forming each of the plurality of pre-sorted attribute lists to include a 
plurality of entries, each of the plurality of entries comprising a record id for 
identifying a record associated with the corresponding one of the attributes, a 
value of the corresponding one of the attributes, and a value of the class label 
attribute associated with the record./ 

4. The method of cl6im 1 , wherein said partitioning step partitions a 
current node from among the plurality of nodes of the decision tree, starting with 
the root node, until the current node includes only attributes that indicate 
membership in a samexlass. 

5. The/method of claim 1 , wherein said partitioning step partitions a 
current node from among the plurality of nodes of the decision tree, starting with 
the root node, until the current node includes more than a predetermined 
threshold number of attributes that indicate membership in a same class. 
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6. The method of claim 1 , wherein said partitioning step comprises 
the step of: / 

for a current leaf node from among the leaf nodes of the de#sion tree, 
computing a lowest value pf a gini index achievecKby 
univariate-based partitions on each of a plurality of attributelists included in the 
current leaf node. / 

7. The method of claim 6, wherein the/gini index is equal to 1 - (P_n) 2 
- (P_P) 2 > P_ n being a percentage of the records of the non-target class in the 
input data set and P_p being a percentage of the records of the target class in 
the input data set. / 

8. The method of claim 6, wherein the percentage of the records P_p 
in the input data set is equaKio W_p * n_p / (W_p * n_p + n_n), W_p being a 
weight of the records oftne target class in the input data set, n_p and n_n being 
a number of the recoras of the target class and a number of the records of the 
non-target class jn the current leaf node, respectively. 

9. / The method of claim 6, wherein said partitioning step further 
comprises the steps of: 
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detecting subspace clusters of the records of the target class associated 
with the current leaf node; 

computing the lowest value of the gini index achieved by distance-based 
partitions on each of the plurality of attribute lists included in the^current leaf 
node, the distance-based partitions being based on distances to the detected 
subspace clusters; 

partitioning pre-sorted attribute lists included in tfie current node into two 
sets of ordered attribute lists based upon a greater erne of the lowest value of the 
gini index achieved by univariate partitions and \\)e lowest value of the gini index 
achieved by distance-based partitions; and 

creating new child nodes for each of/fhe two sets of ordered attribute lists. 



10. The method of claim 9,^vherein said detecting step comprises the 
steps of: 

computing a minimum support (minsup) of each of the subspace clusters 
that have a potential of providing a lower gini index than that provided by the 
univariate-based partitions 

identifying one-dirnensional clusters of the records of the target class 
associated with the current leaf node; 

beginning vvith the one-dimensional clusters, combining centroids of 
K-dimensionaLclusters to form candidate (K+1)-dimensional clusters; 
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identifying a number of the records of the target class that fall into ^ach of 
the (K+1)-dimensional clusters; 

pruning any of the (K+1 )-dimensional clusters that have a suj/port lower 
than the minsup. 

1 1 . The method of claim 1 0, wherein the support^ a subspace cluster 
is denoted as n_p7n_p, n_p' being a number of the recoras of the target class in 
the subspace cluster, and n_p being a total number 9/ the records of the target 
class in the current leaf node. 



12. The method of claim 1 1 , wherein the minsup is denoted as 
(2q-2q A 2-G_best)/(2q-2q A 2-qG_best), Gbest being a smallest gini index given 
by the univariate-based partitions, q being n_p/n_n, and n_n being a total 
number of the records in the current/ieaf node. 



7 




13. The method of claim 10, wherein said step of identifying the 
one-dimensional clusters of ll\e records of the target class comprises the steps 

/ 

dividing a domain of each dimension of a data set associated with the 
current leaf node inter a predetermined number of equal-length bins; 

identifying all of the records of the target class falling into each of the 
predetermined clumber of equal-length bins; and 



/ith t/e 



for each of a current dimension of the data set associated with Jhe current 
leaf node, 

constructing a histogram for the current dimension^ and 
identifying clusters of records of the target clas^on the current 
dimension, using the histogram. 

14. The method according to claim 9, wherein said step of computing 
the lowest value of the gini index achieved by dist^rice-based partitions 
comprises the steps of: 

identifying eligible subspace clusters fr6m among the subspace clusters, 
an eligible subspace cluster having a set of clustered dimensions such that only 
less than all of the clustered dimensions m the set are capable of being included 
in another set of clustered dimensions/of another subspace cluster; 

selecting top-K clusters from ^mong the eligible subspace clusters, the 
top-K clusters being ordered by a^number of records therein; 

for each of a current top^K cluster, 

computing a centroid of the current top-K cluster and a weight on 
each dimension of the current top-K cluster; and 

computing/the gini index of the current top-K cluster, based on a 
weighted Euclidean distance to the centroid; and 

recording a Idwest gini index achieved by said step of computing the gini 
index of the curre/it top-K cluster. 
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1 5. The method of claim 9, wherein each of the plurality of pre-sbrted 
attribute lists comprises a plurality of entries, and said step of partitioning the 
pre-sorted attribute lists comprises the steps of: / 

determining whether univariate partitioning or distance-based partitioning 
has occurred; / 

creating a first hash table that maps record ids of any^of the records that 
satisfy a condition A=v to a left child node and that mapsyme record ids of any of 
the records that do not satisfy the condition A=v to a right child node, A being an 
attribute and v denoting a splitting position, when the univariate partitioning has 
occurred; / 

creating a second hash table that maps / the record ids of any of the 
records that satisfy a condition Dist(d, p, wV^/ to a left child node and that maps 
the record ids of any of the records thatdo not satisfy the condition Dist(d, p, 
w)=v to a right child node, when the distance-based partitioning has occurred, d 
being a record associated with a current subspace cluster, p being a centroid of 
the current subspace cluster, and w being a weight on dimensions of the current 
subspace cluster; / 

partitioning the presorted attribute lists into the two sets of ordered 
attribute lists, based oh information in a corresponding one of the first hash table 
or the second hasn table; 
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appending each entry of the two sets of ordered attribute lists to/one of 
the left child node or the right child node, based on the information inr the 
corresponding one of the first hash table or the second hash table4nd 
information corresponding to the each entry, to maintain attribute ordering in the 
two sets of ordered attribute lists that corresponds that in theypre-sorted attribute 
lists. 

16. The method of claim 1 , wherein said commuting step computes 
different Euclidean distance functions for at least some of the leaf nodes. 

17. The method of claim 1 , wherein said computing step computes 
different Euclidean distance functions for each/of the leaf nodes. 

18. The method of claim 1 , wherein said computing step comprises the 
steps of: 

for a current leaf node from am6ng the leaf nodes of the decision tree, 

identifying relevant attributes of the current leaf node; 

computing a weight for each of the relevant attributes; 

computing a confidence of the current leaf node; 

computing a cerjtroid of the records of a majority class in the 
current leaf node; and 

computing a weigbft of each relevant dimension of the current leaf node. 
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19. The method of claim 18, wherein an attribute is relevant when any 
node on a path from the root node to the current leaf node one of appears in a 
univariate test that splits the current leaf node, appears in a distance function 
test with a non-zero weight that splits the current leaf node, and \d absent from 
any tests but points on the current leaf node are clustered on ar given dimension. 

20. The method of claim 18, wherein a dimensions relevant when any 
node on a path from the root node to the current leaf node one of appears in a 
univariate test that splits the current leaf node, appears in a distance function 
test with a non-zero weight that splits the current leaf node, and is absent from 
any tests but points on the current leaf node are/clustered on the dimension. 

21. The method of claim 1, wherein said identifying step comprises the 
steps of: / 

for a current leaf node from among the leaf nodes of the decision tree, 

computing a maximi/m distance of the current leaf node between a 

centroid of the current leaf node and any of the records that are associated with 

the current leaf node; / 

computing/a minimum distance of the current leaf node between 

the centroid of the cj^rent leaf node and any of the records that are associated 

with other leaf no/fes; 
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forming the nearest neighbor set of the current leaf node to consist 
of only the other leaf nodes that have a corresponding minimum distance/that is 
less than the maximum distance of the current node; and / 

pruning from the nearest neighbor set of the current leaff node any 
nodes therein having a minimal bounding rectangle that contains the minimal 
bounding rectangle of the current leaf node. / 

22. The method according to claim 1, whereinr said classifying and 
scoring step comprises the steps of: / 

for each of the plurality of nodes of the decision tree, starting at the root 
node, / 

evaluating a Boolean condition ang following at least one branch of 
the decision tree until a leaf node is reached; / 

classifying the reached leaf node based on a majority class of any 
of the predetermined attributes included/therein; 

for each node in the nearest Neighbor set of nodes for the reached leaf 
node, / 

computing a distance between a record to be scored and a centroid 
of the reached leaf node, using a distance function computed for the reached 
leaf node; and / 

scoring tha/record using a maximum value of a score function, the 
score function define/ as conf/dist(d,p,w,), wherein the conf is a confidence of 



YOR9-2001-0363US1 (8728-519) 



-44- 



* 



the reached node, d is a particular record associated with a current/subspace 
cluster, p is a centroid of the current subspace cluster, and w is/a Weight on 
dimensions of the subspace cluster. 

23. The method of claim 1 , wherein said method implemented by a 
program storage device readable by machine, tangibly embodying a program of 
instructions executable by the machine to perform said method steps. 



24. A method for building a decision tree ffom an input data set, the 
n input data set comprising records and associated^ attributes, the attributes 

- — / 

J including a class label attribute for indicating w/ether a given record is a member 

= t=3 

m 10 of a target class or a non-target class, the iruSut data set being biased in favor of 

, R / 

%y / 

in the records of the non-target class, the decision tree comprising a plurality of 

:„ nodes that include leaf nodes, said metliod comprising the steps of: 

constructing the decision tree/from the input data set, based upon 
multivariate subspace splitting criteria; 
15 identifying a nearest neighbor set of nodes for each of the leaf nodes 

based upon a respective closeness of the nearest neighbor set of nodes to a 
target record of the target ^lass, as respectively measured by distance functions 
computed for each of the leaf nodes; and 

classifying and/scoring the records, based upon the decision tree and the 
20 nearest neighbor set of nodes. 
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25. The method of claim 24, wherejjr said method is implemented by a 
program storage device readable by machine, tangibly embodying a program of 
instructions executable by the magKine to perform said method steps. 

Sf 
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